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Abstract 

A 6.4 kbps Improved Multi-Band Excitation (IMBE) speech 
coder is presented. This speech coder combines high speech 
quality with a robustness to channel impairments which is 
necessary for successful operation in a mobile communica- 
tion environment. MOS results for the IMBE speech coder 
are compared against those of four 6.4 kbps CELP based 
speech coders which were tested as part of the INMARSAT- 
M voice codec evaluation. The IMBE system yielded the best 
performance of the systems which were tested. It received an 
MOS score of 3.4 at both 0% and 1% BER. The test results 
show that the IMBE system is a viable alternative to CELP 
based speech coders. 


1 Introduction 

One application of low-rate speech coders is mobile commu- 
nications. In order for a speech coder to be successful in the 
mobile communication environment, it must combine high 
speech quality with a robustness to channel impairments. 
This combination of performance requirements presents a dif- 
ficult problem for low-rate speech coders. 

There are currently many different speech coders oper- 
ating at rates between 4 kbps and 8 kbps [7]. A major- 
ity of these speech coders are variants of the Code Excited 
Linear Prediction (CELP) speech coder presented in [2,8]. 
The Improved Multi-Band Excitation (IMBE) speech coder 
is based upon an alternative technology, which was origi- 
nally developed at M.I.T. [4,6]. The IMBE speech coder is a 
model-based system which is capable of producing high qual- 
ity speech at rates down to 2.4 kbps. In addition it has been 
implemented in real-time using an AT&T DSP32 or DSP32C 
processor [1]. 

Previous work on the IMBE speech coder had focussed on 
improving the quality of the synthesized speech. Little work 
had been done on improving the robustness of the system to 
channel impairments. As a consequence the IMBE system 
suffered significant degradation as the Bit Error Rate (BER) 
approached 1 %. In order to solve this problem a number of 
modifications have been made to the encoding and decod- 
ing algorithms. These modifications include a more robust 
quantization method, the inclusion of forward error correc- 
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tion and the addition of adaptive smoothing. The primary 
focus of this paper is the use of these techniques in the de- 
velopment of a 6.4 kbps IMBE system. The performance 
objectives were to operate at 1% BER with little degrada- 
tion and at 4% BER with moderate degradation. Although 
the focus of this paper is the development of a 6.4 kbps sys- 
tem, the techniques which are presented here can be applied 
to other IMBE speech coders operating at rates as low as 2.4 
kbps. 

The organization of this document is as follows. Section 2 
reviews the Multi-Band Excitation (MBE) speech model. Sec- 
tion 3 discusses an improved method for quantizing the MBE 
model parameters. Section 4 discusses an efficient method for 
applying forward error correction, and Section 5 discusses an 
adaptive smoothing algorithm which reduces the perceived 
effect of some uncorrectable bit errors. Section 6 presents 
Mean Opinion Scores (MOS) comparing the new 6.4 kbps 
IMBE system with several CELP based speech coders oper- 
ating at the same rate. The paper concludes with Section 7. 

2 Multi-Band Excitation Speech Model 

The MBE speech model was developed by Griffin and Lim in 
1984 [3]. This model uses a more flexible representation of the 
speech excitation than traditional speech models. As a con- 
sequence it is able to produce more natural sounding speech, 
and it is more robust to the presence of acoustic background 
noise. These properties have made the MBE speech model a 
prime framework for the development of high quality low-rate 
speech coders. 

Let s(n) denote a discrete speech signal obtained by sam- 
pling an analog speech signal. In order to focus attention on 
a short segment of speech over which the model parameters 
are assumed to be constant, the signal s(n) is multiplied by a 
window w(n) to obtain a windowed speech segment or frame, 
$ w (n). The speech segment s w (n) is modelled as the response 
of a linear filter h w (n) to some excitation signal e w (n). There- 
fore, the Fourier Transform of ^(n), can be expressed 

as 

S w {u) = H w (lj)E w (u) (1) 

where H w (u) and JSu,(u;) are the Fourier Transforms of/i^fn) 
and e w (n) y respectively. The spectrum H w (lj) is often re- 
ferred to as the spectral envelope of the speech segment. 

In traditional speech models speech is divided into two 
classes depending upon the nature of the excitation signal. 
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Figure 1: IMBE Speech Coder 

For voiced speech the excitation signal is a periodic impulse 
sequence, where the distance between impulses is the pitch 
period. For unvoiced speech the excitation signal is a white 
noise sequence. 

In traditional speech models each speech segment is classi- 
fied as either entirely voiced or entirely unvoiced. In contrast 
the MBE speech model divides the excitation spectrum into 
a number of non-overlapping frequency bands and makes a 
voiced or unvoiced (V/UV) decision for each frequency band. 
This approach allows the excitation signal for a particular 
speech segment to be a mixture of periodic (voiced) energy 
and aperiodic (unvoiced) energy. This added flexibility in 
the modelling of the excitation signal allows the MBE speech 
model to produce high quality speech and to be robust to the 
presence of background noise. 

Speech coders based on the MBE speech model use an 
algorithm to estimate a set of model parameters for each seg- 
ment of speech. The MBE model parameters consist of a 
fundamental frequency, a set of V/UV decisions which char- 
acterize the excitation signal, and a set of spectral amplitudes 
which characterize the spectral envelope. Once the MBE 
model parameters have been estimated for each segment, they 
are quantized and transmitted to the decoder. The decoder 
then reconstructs the model parameters and synthesizes a 
speech signal from the MBE model parameters. Algorithms 
which have been developed for estimation and synthesis of the 
model parameters are presented in [3,4]. A block diagram of 
a typical speech coder based on the MBE speech model is 
shown in Figure 1 

3 Parameter Quantization 

Efficient methods for quantizing the MBE model parameters 
have been presented in [4,5,6]. However, significant modifi- 
cations to these algorithms were required in order to improve 
the robustness of the system to channel impairments. 

For a 6.4 kbps speech coder using a 50 Hz frame rate, 
128 bits are available per frame. In the IMBE system 45 
for these bits are reserved for forward error correction as is 
discussed in Section 4. The remaining 83 bits per frame are 
used to quantize the MBE model parameters, which consist 
of a fundamental frequency a>o, a set of V/UV decisions i>k 
for 1 < k < K , and a set of spectral amplitudes M) for 
1 < l < L. The values of K and L vary depending on the 
fundamental frequency of each frame. The 83 available bits 


Parameter 

Number of Bits 

Fundamental Frequency 
Voiced/Unvoiced Decisions 
Spectral Amplitudes 

8 

k 

75- K 


Table 1: Bit Allocation Among Model Parameters 


are divided among the model parameters as shown in Table 1. 

The fundamental frequency is quantized by first convert- 
ing it to its equivalent pitch period using Equation (2). 



( 2 ) 


The value of Pq is typically restricted to the range 20 < Po < 
120 assuming an 8 kHz sampling rate. In the 6.4 kbps IMBE 
system this parameter is uniformly quantized using 8 bits and 
a step size of .5. This corresponds to a pitch accuracy of one 
half sample. 

The K V/UV decisions are binary values. Therefore they 
can be encoded using a single bit per decision. The 6.4 kbps 
system uses a maximum of 12 decisions, and the width of 
each frequency band is equal to 3tI>o- The width of the highest 
frequency band is adjusted to include frequencies up to 3.8 
kHz. 

The spectral amplitudes are quantized by forming a set 
of prediction residuals. Each prediction residual is the dif- 
ference between the logarithm of the spectral amplitude for 
the current frame and the logarithm of the spectral amplitude 
representing the same frequency in the previous speech frame. 
The spectral amplitude prediction residuals are then divided 
into six blocks each containing approximately the same num- 
ber of prediction residuals. Each of the six blocks is then 
transformed with a Discrete Cosine Transform (DCT) and 
the D.C. coefficients from each of the six blocks are com- 
bined into a 6 element Prediction Residual Block Average 
(PRBA) vector. The mean is subtracted from the PRBA 
vector and quantized using a 6 bit non-uniform quantizer. 
The zero-mean PRBA vector is then vector quantized us- 
ing a 10 bit vector quantizer. The 10 bit PRBA codebook 
was designed using a k-means clustering algorithm on a large 
training set consisting of zero-mean PRBA vectors from a va- 
riety of speech material. The higher-order DCT coefficients 
which are not included in the PRBA vector are quantized 
with scalar uniform quantizers using the 59 — k remaining 
bits. The bit allocation and quantizer step sizes are based 
upon the long-term variances of the higher order DCT coef- 
ficients. A block diagram of the spectral amplitude quanti- 
zation algorithm is shown in Figure 2. 

There are several advantages to this quantization method. 
First, it provides very good fidelity using a small number of 
bits and it maintains this fidelity as L varies over its range. 
In addition the computational requirements of this approach 
are well within the limits required for real-time implementa- 
tion using a single DSP such as the AT&T DSP32C. Finally 
this quantization method separates the spectral amplitudes 
into a few components, such as the mean of the PRBA vec- 
tor, which are sensitive to bit errors, and a large number 
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Figure 2: Spectral Amplitude Quantization 

of other components which are not very sensitive to bit er- 
rors. Forward error correction can then be used in an efficient 
manner by providing a high degree of protection for the few 
sensitive components and a lesser degree of protection for the 
remaining components. This is discussed in the next section. 

4 Forward Error Correction 

The 45 bits per frame which are reserved for forward error 
correction are divided among [23,12] Golay codes which can 
correct up to 3 errors, [15,11] Hamming codes which can cor- 
rect single errors and parity bits. The six most significant 
bits from the fundamental frequency and the three most sig- 
nificant bits from the mean of the PRBA vector are first 
combined with three parity check bits and then encoded in a 
[23,12] Golay code. A second Golay code is used to encode 
the three most significant bits from the PRBA vector and 
the nine most sensitive bits from the higher order DCT coef- 
ficients. All of the remaining bits except the seven least sen- 
sitive bits are then encoded into five [15,11] Hamming codes. 
The seven least significant bits are not protected with error 
correction codes. 

Prior to transmission the 128 bits which represent a par- 
ticular speech segment are interleaved such that at least five 
bits separate any two bits from the same code word. This 
feature spreads the effect of short burst errors over several 
different codewords, thereby increasing the probability that 
the errors can be corrected. 

At the decoder the received bits are passed through Go- 
lay and Hamming decoders which attempt to remove any bit 
errors from the data bits. The three parity check bits are 
checked and if no errors are detected then the received bits 
are used to reconstruct the MBE model parameters for the 
current frame. Otherwise if an error is detected then the re- 
ceived bits for the current frame are ignored and the model 
parameters from the previous frame are repeated for the cur- 
rent frame. 

The Golay and Hamming decoders also provide informa- 
tion on the number of correctable bit errors in the data. This 
information is used by the decoder to estimate the bit er- 
ror rate. The estimate of the bit error rate is used to control 
adaptive smoothers which increase the perceived speech qual- 
ity in the presence of uncorrectable bit errors [2]. An adaptive 
smoothing algorithm which is applied to the V/UV decisions 
is the subject of the next section. 


5 Smoothing the V/UV Decisions 

In order to improve the perceived speech quality in the pres- 
ence of uncorrectable bit errors, the decoder adaptively per- 
turbs the V/UV decisions. Uncorrectable bit errors in the 
V/UV decisions tend to create large distortions in the syn- 
thesized speech as frequency bands are switched from voiced 
to unvoiced or unvoiced to voiced. These distortions are 
most apparent for spectral amplitudes which contain a large 
amount of energy. The adaptive smoothing algorithm reduces 
the switching distortions for high energy spectral amplitudes. 
The algorithm exploits the fact that high energy spectral am- 
plitudes are more likely to be voiced than unvoiced. In the 
case where the estimated error rate €r goes above a prede- 
termined threshold, all high energy spectral amplitudes are 
forced to be voiced regardless of the received V/UV decisions. 

The adaptive smoothing algorithm operates if €r > .003. 
In that case each spectral amplitude is compared with an 
adaptive threshold and if the spectral amplitude is greater 
than the threshold then that spectral amplitude is declared 
to be voiced. If the spectral amplitude is less than the thresh- 
old then the decoded V/UV decision is left unchanged. The 
adaptive threshold, denoted Mj, is a function of the error 
rate and the local average of the energy in the decoded spec- 
tral amplitudes. It can be computed according to Equation 
(3), where the local average of the energy, 5s, is updated 
each frame according to Equation (4). 


45.255 (S fi )- 375 

€R < .02 


exp(173.29e*) 

(3) 

1.414 (Se)' 37s 

otherwise 


L 


= .95 Se + -05 

(4) 


/=i 


Listening tests have revealed that in the presence of un- 
correctable bit errors in the V/UV decisions this adaptive 
smoothing algorithm significantly increases the perceived qual- 
ity of the synthesized speech. In addition this algorithm does 
not degrade the speech in the absence of bit errors since it is 
switched off if €r < .003. 

6 Test Results 

A 6.4 kbps IMBE speech coder incorporating the improve- 
ments described in this paper has been implemented in real- 
time using a single AT&T DSP32C processor. This sys- 
tem has been evaluated as a voice coding candidate for the 
INMARSAT-M mobile satellite communication system. As 
part of the evaluation Telecom Australia Research Laborato- 
ries conducted extensive tests of this system and six other 6.4 
kbps CELP based speech coders. The tests were conducted 
using CCITT guidelines for MOS testing. The result for the 
live best candidates averaged over different input levels are 
shown in Figure 3. MOS scores are shown for 5 different test 
conditions which consist of 0% BER, .1% BER, 1% BER, 4% 
BER and a burst error condition which is typical in mobile 
communications. The results for the INMARSAT standard 
9.6 kbps system at 0% BER and .1% BER are also shown in 
this figure. 
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Figure 3: Spectral Amplitude Quantization 


The test results reveal several important facts. The first 
is that in the absence of bit errors the IMBE system provides 
higher speech quality than all of the other 6.4 kbps systems 
which were tested. In addition the IMBE system performed 
virtually as well as the 9.6 kbps system. The results also 
show that the IMBE system is more robust to channel im- 
pairments than the other systems which were evaluated. This 
is evidenced by the fact that as the bit error rate increases the 
performance gain of the IMBE system is gradually increasing. 

7 Conclusions 

The primary focus of this paper has been the development of 
a 6.4 kbps IMBE speech coder which is well suited for mo- 
bile communications. MOS testing of this system has shown 
that it provides high speech quality and that it is robust to 
channel impairments. The IMBE speech coder is clearly a vi- 
able alternative to CELP based speech coders. Recently the 
IMBE system has been selected as the voice coding standard 
for the INMARSAT-M and AUSSAT mobile satellite com- 
munication systems. Future work will focus on the inclusion 
of soft decision decoding, more efficient implementations and 
the development of variable rate systems. 
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